Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BloomFilter query rewrite (without pushdown optimization) #248

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Feb 8, 2024

Description

Implemented BloomFilter skipping index query rewrite by introducing the new BloomFilterMightContain expression. This internal expression serves to represent BloomFilter queries, aligning with the approach taken in a previous PR with the addition of BloomFilterAgg. In the absence of pushdown optimization in the Flint data source, this PR includes updates to the integration tests to validate both code generation and evaluation execution.

PR Planned

Documentation

https://github.com/dai-chen/opensearch-spark/blob/implement-bloom-filter-query-rewrite-no-pushdown/docs/index.md#feature-highlights

Issues Resolved

#206

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen added enhancement New feature or request 0.2 labels Feb 8, 2024
@dai-chen dai-chen self-assigned this Feb 8, 2024
…down

Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
@@ -132,16 +132,23 @@ public void writeTo(OutputStream out) throws IOException {
* @param in input stream
* @return bloom filter
*/
public static BloomFilter readFrom(InputStream in) throws IOException {
DataInputStream dis = new DataInputStream(in);
public static BloomFilter readFrom(InputStream in) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to try-catch as Spark codegen doesn't allow checked exception

@dai-chen dai-chen marked this pull request as ready for review February 9, 2024 20:58
@dai-chen dai-chen added 0.3 and removed 0.2 labels Feb 28, 2024
override def eval(input: InternalRow): Any = {
val value = valueExpression.eval(input)
if (value == null) {
null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why eval result is null? Should bloomFilter.test(null) return false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following Spark SQL NULL semantics, NULL is ignored in BloomFilterAgg. So NULL is returned for bloom_filter_might_contain(clientip, NULL). Reference: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala#L100

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, what's discussed here will happen only if WHERE clientip = NULL. We're concerned it's rewritten to bloom_filter_might_contain(clientip, NULL) which skips source file by mistake.

I did some test and found out that col = NULL will be optimized by Spark directly because it always returns empty result:

spark-sql> EXPLAIN SELECT `@timestamp`, request FROM ds_tables.http_logs WHERE clientip = null;
== Physical Plan ==
LocalTableScan <empty>, [@timestamp#5, request#7]

Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen force-pushed the implement-bloom-filter-query-rewrite-no-pushdown branch from a81ae4d to ce21393 Compare March 6, 2024 21:18
@dai-chen dai-chen merged commit d3cdb0e into opensearch-project:main Mar 7, 2024
4 checks passed
@dai-chen dai-chen deleted the implement-bloom-filter-query-rewrite-no-pushdown branch March 7, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.3 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants